{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install --user graphviz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 12 - Decision Trees\n", "\n", "For this lab, we will use survey data collected by the city of [Somerville, MA](https://en.wikipedia.org/wiki/Somerville,_Massachusetts) asking residents about their happiness, as well as ratings of city services. \n", "\n", "The data is available from the UC Irvine Machine Learning Repository: [https://archive.ics.uci.edu/ml/datasets/Somerville+Happiness+Survey](https://archive.ics.uci.edu/ml/datasets/Somerville+Happiness+Survey)\n", "\n", "The link to download the data is [https://archive.ics.uci.edu/ml/machine-learning-databases/00479/SomervilleHappinessSurvey2015.csv](https://archive.ics.uci.edu/ml/machine-learning-databases/00479/SomervilleHappinessSurvey2015.csv)\n", "\n", "The data columns are:\n", "\n", "- D = decision attribute (D) with values 0 (unhappy) and 1 (happy) \n", "- X1 = the availability of information about the city services \n", "- X2 = the cost of housing \n", "- X3 = the overall quality of public schools \n", "- X4 = your trust in the local police \n", "- X5 = the maintenance of streets and sidewalks \n", "- X6 = the availability of social community events \n", "\n", "Attributes X1 to X6 have values 1 to 5." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from sklearn import tree\n", "import graphviz\n", "from graphviz import Source\n", " \n", "from sklearn.tree import export_graphviz\n", "import sklearn.metrics as met\n", "\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the data into a dataframe. We have given the columns more descriptive names." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "new_column_names = [\"happy\",\"city_info\",\"housing_cost\", \"school_quality\", \\\n", " \"trust_police\", \"streets_sidewalks\", \"community_events\"]\n", "city = pd.read_csv(\"SomervilleHappinessSurvey2015.csv\", \\\n", " encoding = \"utf-16le\",names = new_column_names, \\\n", " header = 0)\n", "city.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classwork\n", "\n", "The code belows allows you to make your own decision tree. What three conditions should you use to get the highest accuracy?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# top level of decision tree\n", "filter_level_1 = city[\"school_quality\"] < 4\n", "level_2_left = city[filter_level_1]\n", "level_2_right = city[~filter_level_1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# second level of decision tree on left\n", "filter_level_2_left = level_2_left[\"housing_cost\"] < 4\n", "level_3_left_left = level_2_left[filter_level_2_left]\n", "level_3_left_right = level_2_left[~filter_level_2_left]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# second level of decision tree on right\n", "filter_level_2_right = level_2_right[\"community_events\"] < 4\n", "level_3_right_left = level_2_right[filter_level_2_right]\n", "level_3_right_right = level_2_right[~filter_level_2_right]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# make predictions\n", "\n", "proportion_1 = level_3_left_left[\"happy\"].sum()/level_3_left_left.shape[0]\n", "if (proportion_1 >= 0.5):\n", " confusion_matrix_left_left = confusion_matrix(level_3_left_left[\"happy\"],np.ones(level_3_left_left.shape[0]))\n", "else:\n", " confusion_matrix_left_left = confusion_matrix(level_3_left_left[\"happy\"],np.zeros(level_3_left_left.shape[0]))\n", "\n", "proportion_1 = level_3_left_right[\"happy\"].sum()/level_3_left_right.shape[0]\n", "if (proportion_1 >= 0.5):\n", " confusion_matrix_left_right = confusion_matrix(level_3_left_right[\"happy\"],np.ones(level_3_left_right.shape[0]))\n", "else:\n", " confusion_matrix_left_right = confusion_matrix(level_3_left_right[\"happy\"],np.zeros(level_3_left_right.shape[0]))\n", "\n", "proportion_1 = level_3_right_left[\"happy\"].sum()/level_3_right_left.shape[0]\n", "if (proportion_1 >= 0.5):\n", " confusion_matrix_right_left = confusion_matrix(level_3_right_left[\"happy\"],np.ones(level_3_right_left.shape[0]))\n", "else:\n", " confusion_matrix_right_left = confusion_matrix(level_3_right_left[\"happy\"],np.zeros(level_3_right_left.shape[0]))\n", "\n", "\n", "proportion_1 = level_3_right_right[\"happy\"].sum()/level_3_right_right.shape[0]\n", "if (proportion_1 >= 0.5):\n", " confusion_matrix_right_right = confusion_matrix(level_3_right_right[\"happy\"],np.ones(level_3_right_right.shape[0]))\n", "else:\n", " confusion_matrix_right_right = confusion_matrix(level_3_right_right[\"happy\"],np.zeros(level_3_right_right.shape[0]))\n", "\n", "cm = confusion_matrix_left_left + confusion_matrix_left_right + confusion_matrix_right_left + \\\n", " confusion_matrix_right_right\n", "\n", "tn, fp, fn, tp = cm.ravel()\n", "\n", "sensitivity = tp/(tp + fn)\n", "specificity = tn/(tn + fp)\n", "precision = tp/(tp + fp)\n", "accuracy = (tp + tn)/(tp + tn + fp + fn)\n", "\n", "print(\"Sensitivity:\",sensitivity)\n", "print(\"Specificity:\",specificity)\n", "print(\"Precision:\", precision)\n", "print(\"Accuracy:\",accuracy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fitting a decision tree with sci-kit learn\n", "\n", "We can get just the independent variables (x's) using the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "city.iloc[:,1:7]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we created the decision tree classifier variable (object) and then fit it to our data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf = tree.DecisionTreeClassifier(max_depth = 2)\n", "clf = clf.fit(city.iloc[:,1:7], city[\"happy\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are running Jupyter Hub on your own computer, you may be able to display the decision tree by:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree.plot_tree(clf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are using the Jupyter Hub server, run the following code (which will give an error):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "dot_data = tree.export_graphviz(clf, out_file=None) \n", "graph = graphviz.Source(dot_data) \n", "graph.render(\"happiness.dot\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, despite the error, there should now be a file called happiness.dot in your directory. To view the fitted decision tree, open the happiness.dot file in Jupyter and copy the text. Paste this text into the text box at [http://www.webgraphviz.com](http://www.webgraphviz.com) and click the \"Generate graph!\" button at the bottom.\n", "\n", "The column names have been replaced by `X[0], X[1], ..., X[5]`. Run the following code to change `X[0], X[1], ..., X[5]` to the column names in happiness.dot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open (\"happiness.dot\", \"r\") as fin:\n", " with open(\"happiness_fixed.dot\",\"w\") as fout:\n", " for line in fin.readlines():\n", " line = line.replace(\"X[0]\",\"city_info\")\n", " line = line.replace(\"X[1]\",\"housing_cost\")\n", " line = line.replace(\"X[2]\",\"school_quality\")\n", " line = line.replace(\"X[3]\",\"trust_police\")\n", " line = line.replace(\"X[4]\",\"streets_sidewalks\")\n", " line = line.replace(\"X[5]\",\"community_events\")\n", " fout.write(line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Copy the contents of happiness.dot into the textbox in [http://www.webgraphviz.com](http://www.webgraphviz.com) to display the decision tree with the column names. How does it compare the the decision tree you made?\n", "\n", "To make predictions, we can use the following code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictions = clf.predict(city.iloc[:,1:7])\n", "predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We compute the confusion matrix:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "met.confusion_matrix(city[\"happy\"], predictions)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get the true negatives, false positives, false negatives, and true positives:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tn, fp, fn, tp = met.confusion_matrix(city[\"happy\"], predictions).ravel()\n", "\n", "sensitivity = tp/(tp + fn)\n", "specificity = tn/(tn + fp)\n", "precision = tp/(tp + fp)\n", "accuracy = (tp + tn)/(tp + tn + fp + fn)\n", "\n", "print(\"Sensitivity:\",sensitivity)\n", "print(\"Specificity:\",specificity)\n", "print(\"Precision:\", precision)\n", "print(\"Accuracy:\",accuracy)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.8" } }, "nbformat": 4, "nbformat_minor": 2 }